library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(here)
## here() starts at /Users/racquellemangahas/Desktop/stat547_class/project/group_13
library(ggplot2)
library(tidyr)
We found the dataset at: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016/data
This compiled dataset pulled from four other datasets linked by time and place was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. The inspiration for this study was to prevent suicide. This data set includes 11 columns and provides information about country, year, sex, age group, count of suicides, population, suicide rate, country-year composite key, gdp_for_year, gdp_per_capita, generation (based on age grouping average).
The references for this study are:
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
suiciderates<- read.table(("suiciderates.csv"),sep=" ")
Peek at dataset:
DT::datatable(suiciderates)
Exploratory Data Analysis of ‘suiciderates’
How many rows?
nrow(suiciderates)
## [1] 27820
How many columns?
ncol(suiciderates)
## [1] 12
Summary of suiciderates dataset:
summary(suiciderates)
## country year sex age
## Austria : 382 Min. :1985 female:13910 15-24 years:4642
## Iceland : 382 1st Qu.:1995 male :13910 25-34 years:4642
## Mauritius : 382 Median :2002 35-54 years:4642
## Netherlands: 382 Mean :2001 5-14 years :4610
## Argentina : 372 3rd Qu.:2008 55-74 years:4642
## Belgium : 372 Max. :2016 75+ years :4642
## (Other) :25548
## suicides_no population suicides.100k.pop
## Min. : 0.0 Min. : 278 Min. : 0.00
## 1st Qu.: 3.0 1st Qu.: 97498 1st Qu.: 0.92
## Median : 25.0 Median : 430150 Median : 5.99
## Mean : 242.6 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131.0 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338.0 Max. :43805214 Max. :224.97
##
## country.year HDI.for.year gdp_for_year....
## Albania1987: 12 Min. :0.483 Min. :4.692e+07
## Albania1988: 12 1st Qu.:0.713 1st Qu.:8.985e+09
## Albania1989: 12 Median :0.779 Median :4.811e+10
## Albania1992: 12 Mean :0.777 Mean :4.456e+11
## Albania1993: 12 3rd Qu.:0.855 3rd Qu.:2.602e+11
## Albania1994: 12 Max. :0.944 Max. :1.812e+13
## (Other) :27748 NA's :19456
## gdp_per_capita.... generation
## Min. : 251 Boomers :4990
## 1st Qu.: 3447 G.I. Generation:2744
## Median : 9372 Generation X :6408
## Mean : 16866 Generation Z :1470
## 3rd Qu.: 24874 Millenials :5844
## Max. :126352 Silent :6364
##
Figuring out NAs in ‘suiciderates’ dataset:
Out of entire dataset (27820 observations of 12 variables), what % are NAs?
sum(is.na(suiciderates))/27820*12
## [1] 8.392236
For column Human Development Index (HDI) for year, what % are NAs?
sum(is.na(suiciderates$HDI.for.year))/27820
## [1] 0.699353
Since there are 8.39% of NAs in the dataset, and the variable ‘HDI for year’ consists of 70% NAs, we have decided to completely ignore that variable in our analyses, since ‘HDI for year’ values wouldn’t be significant to factor in when looking at suicide rates due to lack of data.
Removing NAs and creating refined ‘suicideratesnew’ dataset:
Next, I will select for only the variables I am interested in, thus removing ‘HDI for year’.
suicideratesnew <- suiciderates %>%
select(-HDI.for.year)
DT::datatable(suicideratesnew)
I will now check to see how many NAs are still remaining in this dataset:
sum(is.na(suicideratesnew))/27820*11
## [1] 0
There are now 0% of NAs in the new dataset, further exemplifying that ‘HDI for year’ contained all the NAs.
Exploratory Data Analysis of ‘suicideratesnew’:
How many rows?
nrow(suicideratesnew)
## [1] 27820
How many columns?
ncol(suicideratesnew)
## [1] 11
Summary of suicideratesnew dataset:
summary(suicideratesnew)
## country year sex age
## Austria : 382 Min. :1985 female:13910 15-24 years:4642
## Iceland : 382 1st Qu.:1995 male :13910 25-34 years:4642
## Mauritius : 382 Median :2002 35-54 years:4642
## Netherlands: 382 Mean :2001 5-14 years :4610
## Argentina : 372 3rd Qu.:2008 55-74 years:4642
## Belgium : 372 Max. :2016 75+ years :4642
## (Other) :25548
## suicides_no population suicides.100k.pop
## Min. : 0.0 Min. : 278 Min. : 0.00
## 1st Qu.: 3.0 1st Qu.: 97498 1st Qu.: 0.92
## Median : 25.0 Median : 430150 Median : 5.99
## Mean : 242.6 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131.0 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338.0 Max. :43805214 Max. :224.97
##
## country.year gdp_for_year.... gdp_per_capita....
## Albania1987: 12 Min. :4.692e+07 Min. : 251
## Albania1988: 12 1st Qu.:8.985e+09 1st Qu.: 3447
## Albania1989: 12 Median :4.811e+10 Median : 9372
## Albania1992: 12 Mean :4.456e+11 Mean : 16866
## Albania1993: 12 3rd Qu.:2.602e+11 3rd Qu.: 24874
## Albania1994: 12 Max. :1.812e+13 Max. :126352
## (Other) :27748
## generation
## Boomers :4990
## G.I. Generation:2744
## Generation X :6408
## Generation Z :1470
## Millenials :5844
## Silent :6364
##
Plots
In this first plot, we will look at how suicides may differ between generations, globally between 1985-2016.
gen_suicides <- suicideratesnew %>%
group_by(generation) %>%
summarise("mean_suicides"=mean(suicides_no))
DT::datatable(gen_suicides)
gen_suicides %>%
ggplot() +
geom_col(aes(x=fct_reorder(generation, mean_suicides),y=mean_suicides, fill=generation)) +
xlab("Generation") +
ylab("Mean # of suicides") +
theme_minimal() +
coord_flip() +
ggtitle("Average number of suicides globally across generations (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
In the second plot, we look at how suicide rates have changed over the years, particularly in Canada, and see if there is a trend.
canada_suicides <- suicideratesnew %>%
filter(country== 'Canada') %>%
group_by(year) %>%
summarise("sum_suicides"=sum(suicides_no))
DT::datatable(canada_suicides)
canada_suicides %>%
ggplot() +
geom_line(aes(x=year, y=sum_suicides)) +
xlab("Year") +
ylab("Sum of suicides") +
theme_minimal() +
ggtitle("Number of suicides in Canada (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
Lastly, we will see the distribution of suicides between sexes within the entire dataset.
suicideratesnew %>%
ggplot() +
geom_violin(aes(x=sex, y= suicides_no, fill=sex)) +
xlab("Sex") +
ylab("Number of suicides") +
theme_minimal() +
ggtitle("Distribution of suicides between sexes, globally (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
Research Question
Between 1985-2016, how did suicide rates differ between sexes and generations, and is there a significant correlation with the amount of GDP per capita for each country?
How?
With our research question, we are interested in the suicide rates among different generations. Later, we will perform a linear regression analysis and plot the relevant variables (variables of interest) with a regression line after we come to a conclusion that there is a relationship between these variables.